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ABSTRACT 

Multivariate statistical techniques are most widely used in basic sciences research. These techniques, such as 
classification and data reduction methods, mainly rely on the two estimators, location and scatter. The sample mean and 
covariance matrix is calmer, money is used as estimator. But, they are extremely sensitive to elliptical distribution with 
heavy tails. In this context, many robust alternatives are entrenched. The accuracy of these robust estimators mainly based 
on ‘h’ data points out of n. This paper suggests a robust procedure of selecting ‘/i’ data points, in order to get closely three 
estimates. It demonstrates the efficiency of the proposed procedure, by applying it in classification method under a real 
environment. 
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INTRODUCTION 

In statistics, the location and scatter parameters play a vital role in all multivariate statistical techniques. 
Conventional estimate of location and scatter are the sample mean and covariance matrix, which are very sensitive to 
outlier. In this context, many robust alternatives are entrenched during the past decades. The basic multivariate robust 
estimators are minimum volume ellipsoid (MVE) and minimum. Covariance determinant (MCD) was proposed by 
Rousseau (1985). These methods compute the parameters by finding the h data points, out of n points in the given data set. 
Hence, the estimated parameters rely only on the h data points. It is necessary to select suitable h data points in order to 
compute the parameters. There is no computationally reasonable group wise best point selection procedure of calculating 
MCD and MVE. 

The MVE and MCD of a given data set is determined by a subset to the constraint that the ellipsoid that covers 
the point has minimum volume and minimum covariance determinant among all constructed, using my points 
(Rousseeuw (1985), Hawkins (1993) and Woodruff and Rocke (1993)). The size of the subset is a function of the number 
of the data points’ n and the dimensionality p, and is chosen to give an estimate with a breakdown point of 50%. 

In this context, it is proposed the best point selection (subset selection) procedure, based on MVE and MCD. 
The description classical estimator, maximum likelihood estimator (MLE), robust estimators MVE and MCD along 
with Feasible Solution Algorithm (FSA) are provided in the section 2. The proposed procedure and its computation steps 
are presented in the section 3. The assessment of the proposed procedure over the other procedures has been studied in 
the context of real environments, and the results are summarized in the section 4. Summary of the findings is presented in 
the last section. 
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Classical and Robust Estimators 

The theoretical aspects of the classical and robust estimators, MLE, MVE and MCD along with FSA are briefly 
furnished in this section. 

Maximum Likelihood Estimator 

The principle of maximum likelihood estimation was originally developed by Professor R.A Fisher in 1920. 
The standard estimates are the maximum likelihood estimates or their unbiased variance. The sample mean vectors for 

the provision of the estimates of /?, are given by 

n i 

fi i =Y i =2lY ij /n i , (i=l,...,g) (l) 

j = i 

For the hetroscedastic case, each £ ; is estimated by its training sample analogue usually after correction for bias, to give 

*« ~s, -£<r, -T,w, -TJK h-v. (i-i g). (2) 

j = 1 

For the homoscedastic discriminant analysis model, the standard estimate of the common covariance matrix £ is 
pooled within - group sample covariance matrix 

i = s = £ 2 (Y v - Yi.)(Y - Yi.y/(n - g ) , (3) 

i = 1 /=1 

g 

Where n = S n is the total sample size across groups? 

i=i 

Minimum Volume Ellipsoid (MVE) 

The minimum volume ellipsoid estimator was proposed by Rousseau (1985). The MVE estimator is a very 
powerful procedure for estimating the location and scatter. It is known that it has a breakdown point that approaches 50% 
as the number of points in the data set increases. This is the maximum possible breakdown point, and it means that 
approximately half of the data can be arbitrarily contaminated without affecting estimate. The computational steps are as 
follows: 


Fet, (x-c) T T~\x-c) = p , 


( 4 ) 


Where C and r are location vector and scatter matrix respectively and p is the dimension of the data. The 
location vector is the weighted mean calculated as 

h 

c = Tj W ‘ X ^ ( 5 ) 

i=i 


And, the covariance or scatter matrix is 
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r = 2] w t (x* - C)(x* -c) T , (6) 

1=1 

* . 

Where X I is a cothector denoting the l th observation ofthe subset of points, is the weight for the 

l th observation, and li — [(« + p + 1) / 2] , ([.] denotes the greatest integer function). The volume of the covering ellipsoid 

will be proportional to the determinant of r . It is evident from these equations that to find MVE one must determine 
which he points should be covered and the corresponding weights to ensure coverage of the points. 

Minimum Covariance Determinant (MCD) Estimators 

Rousseau (1985) introduced the minimum covariance determinant estimator (MCD) to estimate the mean vector 
and covariance matrix along with detection of outliers in multidimensional data. The multivariate location and diffusion 
estimation in high breakdown principles are based on the determinant of the covariance matrix. The covariance matrix n 
x p is positive semi-definite matrix, p eigen values are positive, the determinant of the covariance matrix equals the product 

C n 

h subsets, 

and compute the determinant of the covariance matrix for each subset. The subset with smallest determinant is used to 
calculate the usual px 1 mean vector, and corresponding pxp covariance matrix, these estimators are called minimum 
covariance determinant estimators. 

Hawkins (1994) established a feasible solution algorithm (FSA) for MCD approach and its brief descriptions is as 
follows. The FSA is to reduce the determinant of the matrix further by pair wise case swap. The trial partition of the cases 
into trimmed and retained cases. It involves evaluating pair wise exchanges between retained case and a trimmed case. 
The pairwise exchange will lead to a reduction in the covariance determinant. It is clear from the definition that the MCD 

estimator for jLl and 2 is the sample mean vector and covariance matrix of a subset of size fl—h, the determinant of 
X cannot be decreased by any case wise exchange exchanging one of the trimmed cases for one of the retained cases. 

Best Points Selection Procedure (BPS) 

MVE and MCD procedure of interest is to find the points out of n, the total cases of all the groups considered. 
The main limitation of these methods is to extract the large number of subsets of h points out of total n points. For 
the voluminous of given data, computational point of view it is very difficult to select h points out of n by satisfying 
the criteria of minimum volume and minimum determinant. The proposed Best Points Selection (BPS) algorithm based on 
MVE and MCD is to find the weighted mean and weighted covariance matrix, based on the selected h points. The 
proposed BPS procedure is summarized given below. 

First, select the h points in the each group separately by either MVE or MCD procedure and then compute 
location and scatter for the combined all the h points together by giving equal weights. Next, compute the Mahalanobis 
distance for all observations based on the computed weighted location and scatter. Arrange the distances, the first p+j 
(j=l,2,..., h-p) distances are selected and their corresponding sample units are used to compute the next Error! Bookmark 
not defined.and s. Repeat the same procedure until h data points are selected. Then compute location and scatter for the 
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selected h points. The general description of the computational algorithm is as follows. 

Step 1: Let X = (x 1 ,X 2 ,..., X m ) be a set of m points in p , let h be a natural number such 

^ t ml2<h= {g ' +g2+ - + ° <m 

([m + p + 1]/2) 

Step 2: The p + 1 data points {x l ,X 2 ,..., X p+i } , that satisfy the two optimality criteria were selected and use to obtain 
the location and scatter matrix 


1 Jy 

1 V* 1 > W- N 

—;L w i x i . and S ik = 7 ^ f=1 jl w^x.-xMx,, -x k ) 

' +| tr * ,=I 


Step 3: Compute the Mahalanobis distance for all the observations using jc and S, as d ~ — {x — x) S (x — x) 

Step 4: The d~ {i — 1 , 2 ,..., k) is arranged in order of magnitude from the least to the highest. The first p+j (j=2, 3, 4... h- 

p) distances are selected and their corresponding sample units are used to compute the next Error! Bookmark not 
defined.and s as follows 


1 

x = — 

P + J i = 1 


And s jk 


w N 

7 2 - x k ) 

P+ j N I V'*' < ■ 7 1 


The new set of X and s are then used to obtain the Mahalanobis distances for all the observations. 


Step 5: Step 3 and 4 are repeated until the number of units selected is h = 


r (gl+g 2 +-~+8n^ 


([m+p + 1]/2) 

Finally, the location and scatter based on the BPS algorithm can be computed as 


'‘(BPS) 


1 N 

- V wx , 

7„ ' 1 ’ 


h ,= i 


Z N 

W: N 

“ ,1U J = — 7 X ^ ( X a ~ x i)( x ,k ~ x k) 

J k (BPS) /, U-^N \ x-iJV 2 , = i ' » ^ ,A * 


Experimental Results 


This section presents the summary of the discriminant analysis results, specifically confusion matrix, apparent 
error rate and discriminant coefficients, which were obtained based on the classical, robust procedures along with 
the proposed procedure. First, compute the mean vector and covariance matrix, and then performed the discriminant 
analysis for the given training data set. Secondly, the same discriminant function used for the validation data to validate 
the function. 

The iris data set was considered for this experiment. Also, the data set was divided into two categories, the 60% of 
the data were considered as training data and the remaining 40% considered as validation data. The results obtained for 
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the training data under the various procedures are summarized in the table 1 . 


Table 1: Results of Discrimination Analysis under the Various Procedures (Training Data) 


Methods 

MLE 

MVE 

MCD 

FSA 

BPS_MVE 

BPS_MCD 

Discriminant 

Function 


LDl LD2\ 

0.58-1.26 

1.90 3.07 

-1.77 0.32 
-3.52 1.41 J 


LDl LD2\ 
0.12-1.70 

-1.90 3.40 

-1.07 0.17 

-6.00 2.26 J 


LDl LD2 

1.28 0.76 

4.33 1.83 

-8.82 5.35 
-6.00 2.26 



LDl LD2 

0.72 0.97 

6.50 0.92 

-0.44 0.94 
-1.27 -0.78 



LDl LD2 ' 
-1.08 0.61 

5.64 -0.76 

-0.43 -4.37 
-9.66 -9.21 



LDl LD2 

-0.34 1.68 

4.59 -1.43 

0.87 0.77 
-2.38 -6.50 ; 


Confusion 

Matrix 


T.OO 0.00 0.00) 
0.000.97 0.03 

0 . 000 . 001.00 J 


'1.000.000.00) 
0.000.97 0.03 
aoo 0.03 0.97 J 


T. 000.000.00) 

0.000.970.03 

o.oo o.oo l.ooj 


1.000.000.00) 
0.000.970.03 
0.00 0.10 0.90 J 


1.00 0.00 0.00) 
0.001.000.00 
0.00 0 . 001 . 00 


T.00 0.00 0.00) 
0.001.000.00 
0.00 0.00 l.OOj 

AER 

0.01 

0.02 

0.01 

0.04 

0.00 

0.00 


It is observed that the classification of the given training data vary based on the procedures. It is noted that 
the proposed Best Points Selection procedure under MVE and MCD classified the given observations more exactly than 
the other procedures. The confusion matrix and apparent error rate were computed based on the Discriminant function, 
which was generated under various procedures is then used to classify the remaining data, which were treated as validation 
data. The results are summarized in the table 2. 


Table 2: Results of Discrimination Analysis under the Various Procedures (Validation Data) 


Methods 

MLE 

MVE 

MCD 

FSA 

BPS_MVE 

BPS_MCD 

Confusion 

Matrix 

|T .oo o.oo o.oo) 

0.000.950.05 

[o.oo 0.00 1.00; 



'l .00 0.00 o.oo) 

0.000.950.05 

0.000.050.95) 

|T .oo o.oo o.oo) 

0.000.950.05 

[o.oo 0.00 1.00; 



T.oo o.oo olkP 
0.00 1.00 0.00 
0.00 0.35 0.65; 



T.oo o.oo o.oo) 

0.000.97 0.03 

o.ooo.oo l.ooj 

|T .oo o.ooo.oo'' 
0.001.000.00 
[o.oo 0.00 1.00; 


AER 

0.02 

0.03 

0.02 

0.17 

0.01 

0.00 


The computed discriminant function for the training data is validated through the validation data. Almost all 
the procedures produced more apparent error rate for the validated data set, while using the discrimination function, which 
was computed by the training data that is, misclassification probabilities are more than that training data. But, the proposed 
procedure classified exactly under MCD and with little error rate under MVE. Hence, it is concluded that the Best Points 
Selection (BPS) procedure is produces reliable estimates of the mean vector and covariance matrix for the given 
multivariate data. 

CONCLUSIONS 

This paper suggests a robust estimator to estimate the mean vector and covariance matrix of the given multivariate 
data. It is demonstrated that the established BPS procedure gives the reliable results and more efficient than the other 
procedures with the help of real data. It is concluded that, the proposed BPS procedure will be applicable in all multivariate 
techniques, wherever the mean vector and covariance matrix are to be estimated/used and specifically research in high 
dimensional data analysis. 
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